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We consider the problem of estimating the 
conditional probability of a label in time 
O(logn), where n is the number of possible 
labels. We analyze a natural reduction of this 
problem to a set of binary regression prob- 
lems organized in a tree structure, proving a 
regret bound that scales with the depth of 
the tree. Motivated by this analysis, we pro- 
pose the first online algorithm which prov- 
ably constructs a logarithmic depth tree on 
the set of labels to solve this problem. We 
test the algorithm empirically, showing that 
it works succesfully on a dataset with roughly 
10^ labels. 



1 Introduction 

The central question in this paper is how to effi- 
ciently estimate the conditional probability of label 
y G {1, . . . , n} given an observation x X. Virtually 
all approaches for solving this problem require fl(n) 
time. A commonly used one- against- all approach, 
which tries to predict the probability of label i ver- 
sus all other labels, for each i G {l,...,rt}, requires 
n{n) time per training example. Another common 
0(n) approach is to learn a scoring function f{y, x) 
and convert it into a conditional probability estimate 
according to f{y,x)/Z{x), where Z{x) — J^ifih^) is 
a normalization factor. 

The motivation for dealing with the computational dif- 
ficulty is the usual one — we want the capability to 
solve otherwise unsolvable problems. For example, one 
of our experiments involves a probabilistic prediction 
problem with roughly 10^ labels and 10^ examples, 
where any r2(n) solution is intractable. 



1.1 Main Results 

In Section 31 we provide the first online super- 
vised learning algorithm that trains and predicts with 
0(log n) computation per example. The algorithm 
does not require knowledge of n in advance; it adapts 
naturally as new labels are encountered. 

The prediction algorithm uses a binary tree where re- 
gressors are used at each node to predict the condi- 
tional probability that the true label is to the left or 
right. The probability of a leaf is estimated as the 
product of the appropriate conditional probability es- 
timates on the path from root to leaf. In our experi- 
ments, we use linear regressors trained via stochastic 
gradient descent. 

The difficult part of this algorithm is constructing the 
tree itself. When the number of labels is large, it be- 
comes critical to construct easily solvable binary prob- 
lems at the nodes. In Section 321 we introduce a tree- 
construction rule with two desirable properties. First, 
it always results in depth O(logn). It also encourages 
natural problems by minimizing expected loss at the 
nodes. The technique used in the algorithm is also 
useful for other prediction problems such as multiclass 
classification. 

We test the algorithm empirically on two datasets (in 
Section 14. 3p , and find that it both improves perfor- 
mance over naive tree-building approaches and com- 
petes in prediction performance with the common one- 
against-all approach, which is exponentially slower. 

Finally, we analyze a broader set of logarithmic time 
probability estimation methods. In Section 13.11 we 
prove that any tree based approach has squared loss 
bounded by the tree depth squared times the average 
squared loss of the node regressors used. In contrast, 
the PECOC approach [3] has squared loss bounded by 
just 4 times the average squared loss but uses il{n) 
computation. This suggests a tradeoff between com- 
putation and squared loss multiplier. Section 13.21 de- 



scribes a fc-parameterized construction achieving a ra- 
tio of 4(logj,n)^ i^T^) while using 0{k log), n) com- 
putation, where k = 2 gives the tree approach and 
k^n gives PECOC. 

1.2 Prior Work 

There are many methods used to solve conditional 
probability estimation problems, but very few of them 
achieve a logarithmic dependence on n. The ones we 
know are batch constructed regression trees, C4.5 [9], 
IDS [7J, or Treenet [lO], which are both too slow to 
consider on datasets with the scale of interest, and 
incapable of reasonably dealing with new labels ap- 
pearing over time. 

Mnih and Hinton 8 constructed a special purpose 
tree-based algorithm for language modeling, which is 
perhaps the most similar previous work. The algo- 
rithm there is specialized to word prediction and is 
substantially slower since it involves many iterations 
through the training data. However, the general analy- 
sis we provide in Section [3 . 1 1 applies to their algorithm. 
We regard the empirical success of their algorithm as 
further evidence that tree-based approaches merit in- 
vestigation. 

1.3 Outline 

Section [3] states and analyses methods for logarith- 
mic time probabilistic prediction given a tree struc- 
ture. Section 3] gives an algorithm for building the 
tree structure. The analysis in the first section is suf- 
ficiently general so that it applies to the second. 

2 Problem Setting 

Given samples from a distribution P over X x Y, 
where X is an arbitrary observation space and Y = 
{1, . . . , n}, the goal is to estimate the conditional prob- 
ability P{y I a;) of a label y G y for a new observation 
xeX. 

For an estimator Q(y \ x) of P{y \ x), the squared loss 
of Q with respect to P is defined as 

epiQ) = E(,,,)^p(P(2/ I x) - Qiy \ x))\ (1) 

It is more common to define an observable squared 
loss where P{y\x) in equation ([1]) is replaced by 1. 
We consider regret with respect to the common defini- 
tion, since it is well known that the difference between 
observable squared loss and the minimum possible ob- 
servable squared loss is equal to £p{Q). We therefore 
use regret and squared loss interchangeably in this pa- 
per. 



It is well known that squared loss is a strictly proper 
scoring rule ^2j, thus ip{Q) is uniquely minimized by 
Q — P. Our analysis focuses on squared loss because 
it is a bounded proper scoring rule. The boundedness 
implies that convergence guarantees hold under weaker 
assumptions than for unbounded proper scoring rules 
such as log loss. 

3 Probabilistic Prediction Given a 
Tree 

This section assumes that a tree structure is given, 
and analyzes how to use it for probabilistic logarithmic 
time prediction. 

3.1 Conditional Probability Tree 

Consider a fixed binary tree whose leaves are the n 
labels. For a leaf node y ^ Y, let T{y) be the set of 
non-leaf nodes on the path from the root to y in the 
tree. 

Each non-leaf node i is associated with the regression 
problem of predicting the probability, under P, that 
the label y of a given observation a; € X is in the left 
subtree of i, conditioned on i G T{y). The following 
procedure shows how to transform multiclass examples 
into binary examples for each non-leaf node in the tree. 
Here rights (y) is when y is in the left subtree of node 
i, and 1 otherwise. 



Algorithm 1: Conditional Probability Tree Training 
(training set 5, regression algorithm R) 

foreach internal node i do 

foreach example (x, y) ^ S do 
foreach node i £ T{y) do 
|_ Add (a;, right, (y)) to S^. 

foreach internal node i do 
L train /, = R{S^) 



Given a new observation x G X and a label y E Y, we 
use the learned binary regressors fi to estimate P{y \ 
x). Letting | x) = Mx) and Qi{0 \ x) = l-fi{x), 
we define the estimate 

Q{y\^)= n Q.(right,(2/) I x). (2) 

3.1.1 Analysis of the Conditional Probability 
Tree 

Algorithm [T] implicitly defines a distribution Pi over 
X X {0, 1} induced at node i: A sample from Pi is ob- 
tained by drawing (x, y) according to P until i G T(y), 



and outputting (x, rightj(y)) (although we never ex- 
pHcitly perform this sampling). The following theo- 
rem bounds the squared loss of Q given the average 
squared loss of the binary regressors. 

Theorem 1. For any distribution P , any set of node 
estimators Qi, and any pair {x,y), with Q given by 
equation 

{Q{y \ x) ~ P{y \ x)f 

< d" E, (g, (right, (y) I x) - F,(right,(y) | x)f , 

where d = \T{y)\ and the expectation is over i chosen 
uniformly at random from T{y). 

Proof. We use Lemma [5] Using the notation of its 
proof, observe that 

^Ylj<lt-Pt\^ =d'^ (E,^\q,-p,\f 

< d^-E, {\q, - p,\^) 

using Jensen's inequality. □ 

Most of the theorem is proved with the following core 
lemma. For a node i on the path from the root to 
label y, define pi = (right j (y) | x), the conditional 
probability that the label is consistent with the next 
step from i given that all previous steps are consistent. 
Similarly define qi = (rightj (j/) | x). 

Lemma 2. For any distribution P , any set of node 
estimators Qi, and any pair {x,y), with Q given by 
equation ([S]). 

\Q{y \ x) - P{y \ x)\ < ^ - J| max{pj, gfj} 

ieT{y) 
i£T(y) 

The last inequality is the simplest — it says the differ- 
ences in errors add. However, the quantity after the 
first inequality can be much tighter. 

Proof. We first note that 

\Q{y I x) - P{y I a;)| < J|max{pi,gi}-]Jmin{pj,qi} 

i i 

since Hi inax{pi, > max{Q(?/ | x),P{y \ x)} and 
n,niin{K,gJ < mm{Q{y \ x),P{y \ x)}. 

We use a geometric argument. With minjpi, g^} 
defining the volume of one "corner" of a cube with 
sides max{pi,gi}, slabs \qi - Pi | Ilj^i niax{pj , } fill 



in the remaining volume (with overlap). Consequently, 
we can bound the difference in volume as 

max{pj, - J]^ rnin{pi, qi} 

i i 

<^\qi~Pi\Y[ max{pj , qj } 

i j^i 
i 

since all pj and qj are bounded by 1. □ 

As suggested by the proof, the lemma's bound can be 
asymptotically tight. If all pi are equal to some p and 
all \qi ~ pi\ are small, the left side is approximately 
p"^^^ J2i \qi ~ Pi\ ~ dp'^'E\qi ~ Pi\, a factor p"^ times the 
right side. 

3.2 Conditional PECOC 

The conditional probability tree is as computationally 
tractable as we could hope for, but is not as robust 
as we could hope for. For example, the PECOC ap- 
proach [4] yields a squared loss multiplier of 4 inde- 
pendent of the number of labels. Is there an approach 
more robust than the tree, but requiring less compu- 
tation than PECOC? 

We provide a construction which trades off between 
the extremes of PECOC and the conditional probabil- 
ity tree. The essential idea is to shift from a binary 
tree to a fc-way tree, where PECOC with k — 1 regres- 
sors is used at each node in the tree to estimate the 
probability of any child conditioned on reaching the 
node. For simplicity, we assume that fc is a power of 
2, and n is a power of fc. 

Theorem 3. Pick a k-way tree on the set of n labels, 
where k is a power of 2. For all distributions P and 
all sets of learned regressors, with fc — 1 regressors per 
node of the tree, for all pairs {x,y), 

{Q{y I x) - P{y I x)f < 4(log, (^^) ' e^ 

where is the average squared loss of the (fc — 1) log^. n 
questioned regressors. 

Proof. The proof is by composition of two lemmas. 

In each node of the tree, Lemma|4]bounds the power of 
the adversary to disturb the probability estimate as a 
function of the adversary's regret. Similarly, Lemma[5] 
bounds the power of the adversary to induce an overall 
misestimate as a function of the adversary's power to 
disturb the estimates within each node on the path. 

□ 



The curve below illustrates how the construction 
trades off computation for a better regret bound as 
a function of k. 



where the expectation is over i drawn uniformly from 
the rows of C. The reason for this formula is clarified 
by the proof of Lemma 21 
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computation (number of queried regressors) 

To complete the proof of Theorem [3] we describe 
the PECOC construction in Section 13.2.11 and prove 
Lemma m in Section fS. 2. 21 

3.2.1 The PECOC Construction 

The PECOC construction is defined by a binary ma- 
trix C with each column a label and each row defining 
a regression problem. The regression problem corre- 
sponding to row i is to predict the probability given x 
that the correct label is in the subset 



Y, = {yeY: C(z,y) = l}. 



(3) 



We use an explicit family of Hadamard codes given by 
the recursive formula 



Co 



Cot 



Ct 
Ct 



Ct 
l-Ct 



We use a matrix with 2*^^ < n < 2*, noting that 
its size 2* is less than 2n; if 2' > n we simply add 
dummy labels. We henceforth assume without loss of 
generality that n is a power of 2. We train PECOC 
according to the following algorithm. 

Algorithm 2: PECOC Training (training set S, re- 
gression algorithm R) 

for each row i of C do 

Let = {{x,C{t,y)) : {x,y) G S} 
train = R{Si). 



Given a new observation x ^ X and a label y Y, 
PECOC uses the binary regressors r.; learned in Algo- 
rithm [2] to estimate P{y \ x) using the formula 

pecoc(?/ I x) = 2Ei[C{i,y)ri{x) + 

{l-C{t,y)){l-n{x))]-l, (4) 



3.2.2 A Careful PECOC analysis 

The following theorem gives the precise regret bound, 
which follows from the analysis in f4] but is tighter for 
small values of n than the bound stated there. 

Lemma 4. (PECOC regret [4j) For all distributions P 
and all sets of regressors (as defined in Algorithm[2]), 
for all X £ X and y £ Y, 

(pecoc(y I x) - P{y \ x)^ < 

4( E,(r,-P(2/er, 



where Yi is the subset defined by row i per ([3]). 



Proof. Since the code and the prediction algorithm 
are symmetric with respect to set inclusion, we can 
assume without loss of generality that y is in every 
subset (complementing all subsets not containing y). 
Thus every entry C{i,y) = 1, and by ^ the PECOC 
output estimate of P(y \ x) is 

2 " 

pecoc(?/ I = - n(x) - 1. 
n ^ — ' 

1=1 

Let ri{x) = P{y ^Yi\x) — X^ugf I ^) denote the 
perfect subset estimators, and write ri{x) — fi{x) + ei. 
By the nature of C, the label y under consideration 
occurs in every subset, and every other label v ^ y in 
exactly half the subsets, so that 



[V X + Ci 



i \veYi 

= E E p{-\^) + E'. 

V i: YiBv i 

= E f ^(^^ I ^) + "-^(y I ^) + E ^» 

= ^(l + P(y|x))+^e,. 



This gives pecoc(y | x) = P{y \ x) + ^J2i''i' fo^' 
squared loss (pecoc(y | x) - P{y \ x))^ = (^Z^i&i)^- 
One of the subsets, say the first, is trivial (it includes 
all labels), and for it we stipulate the true probability 
ri — 1, so £1 — 0. Letting EjCi denote the mean of the 
other n — 1 errors ei, the squared loss is (2^^Eiei)^, 
establishing the theorem. □ 



4 Online Tree Construction 

The analysis of Section r3. 1 . ll applies to any binary tree, 
and motivates the creation of trees which have smaU 
depth and smaU regret at the nodes. This leaves the 
question, "Which tree should we use?" We give an 
online tree construction algorithm with several use- 
ful properties. In particular, the algorithm doesn't 
require any prior knowledge of the labels, and takes 
0(log n) computation per example, when there are n 
labels. The algorithm guarantees a tree with O(logrt) 
maximum depth using a decision rule that trades off 
between depth and ease of prediction. 

4.1 Online Tree Building Algorithm 

Algorithm [3] builds and maintains a tree, whose leaves 
are in one-to-one correspondence with the labels seen 
so far. Each node i in the tree is associated with a 
regressor fi'.X^ [0, 1]. Given a new sample {x,y) G 
A X y, we consider two cases. 

If y already exists as a label of some leaf in the tree, 
then there is an associated root-to-leaf path and we can 
use the conditional probability tree algorithms of the 
previous section to train and test on {x,y), with one 
minor modification when training: we add a regressor 
at the leaf and train it with the example (x, 0). 

If y does not exist in the tree, then the algorithm still 
traverses the tree to some leaf j , using a decision rule 
that computes a direction (left or right) at each non- 
leaf node encountered. Once leaf j is reached, it nec- 
essarily corresponds to some label y' ^ y. We convert 
j to a non-leaf node with left child y' and right child 
y. The regressor at node j is duplicated for y' . A new 
regressor is created for y and trained on the example 

We now describe the decision rule used to decide which 
way to go (left or right) at each non-leaf node i en- 
countered during the traversal. First, let Li denote 
the number of children to the left of node i, and Ri 
the number to the right. If fi{x) > 1/2, where fi{x) 
is the current prediction associated with node i on x, 
then the regressor favors the right subtree for this in- 
put, and otherwise the left subtree. If the regressor 
favors the side with the smaller number of elements, 
then this direction is chosen. If the regressor favors 
the side with more elements, then the algorithm faces 
a dilemma. On one hand, sending the new label to the 
right would result in a more highly balanced tree, but 
on the other hand it would result in a training sam- 
ple disagreeing with the current regressor's prediction. 
Our resolution is to define an objective function 

obj(p, L, R, a) = {l- a)2{p ~l) + a logs ^ 



Algorithm 3: Online conditional probability tree 
(CPT) Training (regression algorithm R, aggressive- 
ness a) 

create the root node r 
foreach example (x, y) do 

if y has been seen previously then 
|_ For each i S T{y), train fi with (x, right j(y)). 

else 

Set i = r. 

while i is not a leaf do 

if obj(/i(x), Li,Ri, a) > then c = 1 
(right) 

else c = (left) 
Train fi with example {x, c) 
Set i to the child of i corresponding to c 
Create children of leaf i: 

left with a copy of i (including fi), 
right with label y trained on [x, 0). 
Train fi with (x, 1). 



and send the label to the right of node i if 

ohi{f^{x),L^,R^,a) > 0. (5) 

Here a is a free parameter set for the run of the entire 
algorithm. When a — 1, the rule indicates that we 
should place new labels on the side with fewer current 
labels, resulting in a perfectly balanced tree. When 
a = 0, the direction chosen is always the one currently 
favored by the regressor. A trade-off between these two 
objectives is provided by values of a between these two 
extremes. 

Pseudo-code is provided in Algorithm [3l 

4.2 Online Tree Building Analysis 

In this section we analyze Algorithm [3l Throughout 
the section, for any tree node under consideration, we 
will use N for the total number of leaves under the 
node, L the number on the left and R on the right, 
with L + R = N. Wc note that rule ([5]) is symmetric 
with respect to L and R. We also define 

1 

" 1 + 21-1/"- 

Claim ([6]) will establish that at most about a fraction 
K of the leaves can fall on either side of a node, with 
K = 1/2 for a = 1 and k 1 as a ^ 0. 

Claim 5. If a node has L leaves in its left subtree, R 
in the right, and N = L + R altogether, if R/N > k 
then a new leaf is added to the left subtree regardless of 
the prediction value p at the node ( and symmetrically 
for L). 



Proof. For any p G [0, 1], 

obj(p, L, R, a) < (1 - a)2(l - i) - (1 - a) 
= (1 - a) + alog2 ^, 

which is < (forcing a leaf to be added to the left) if 
L/i? < 2^, or equivalently if R/N > k. □ 

Claim 6. Under any non-leaf node, L, R < kN + (1 — 
k). 

Proof. We prove this inductively for R; the result for 
L follows symmetrically. A non-leaf node starts with 
one left and one right child, and R^L — 1, N = 2 
satisfies the claim. Given that R, L, and N satisfy the 
claim, we now prove that when a leaf is added, so do 
the next values R' (either R or R+1), L' (respectively 
L + 1 or L), and A^' = + 1. There are two cases. If 
R< kN then 

R' <R +1< kN+1 = k{N' - 1) + 1 = kA' + 1 - k. 
If i? > kN then the next addition is to L not R, and 
R'^R<kN+1-k< kN' + 1-K. 

□ 

Theorem 7. For all regressors at the nodes of the tree, 
for all learning problems on n labels, for all a G (0, 1] 
the depth of the tree is at most logn/logK + 2. 

Proof. If the root node has n leaves below it, then by 
the preceding claim a child ( "depth 1" ) of the root has 
at most Kn + (I — n) leaves, a grandchild has at most 
K^n + k{1 — k) + (1 — k) leaves, and a depth-d child 
has at most 

K'*n + K'^"i(l - k) H hfc(l-K) + (l-K) < K'^n+l 

leaves, using J27Lq^'^ ~ V(l ~ With d = 

— [Inn/lnre], a depth-d child has at most 2 leaves, 
and thus further depth one, and we add one more to 
account for the ceiling function. □ 

Definition 8. A disagreement is the event when a 
new label reaches a node, and the algorithm decides 
to insert it in the subtree that is not preferred by the 
regressor. 

That is, a disagreement occurs when the regressor's 
prediction is at most 1/2 and the label is inserted to 
the right, or when the prediction is greater than 1/2 
and the label is inserted to the left. 

Note that the number of disagreements incurred when 
adding a new label (leaf) is at most the depth of that 
leaf, and as the tree evolves the "same" leaf (per the 
copying rule of the algorithm) may become deeper but 



never shallower. Thus the total number of disagree- 
ments incurred in building a tree is at most the sum 
of the depths of all leaves of the final tree. 

To get a grasp on this quantity, for simplicity we dis- 
regard the additive 1 — k in Claim [5] coming from 
adding vertices discretely, one at a time. (The effect 
is most dramatic when a node has just two children, 
L = R = 1, and adding a leaf necessarily produces a 
lopsided tree with L = 1 and i? = 2 or vice-versa. For 
large values oi L + R — N the effect of discretization 
is negligible.) 

As usual, for a node in a tree let L be the number of 
leaves in its left subtree, R in the right, N — L + R. 

Theorem 9. Let T be an n-leaf binary tree in which 
for each node, L, R < kN . Then the total of the depths 
of the leaves of T is at most d{n) — nlogn/ H{k), 
where H{k) = —k log k — {1 — k) log(l — k). 

Proof. The proof is by induction on n, starting from 
the base case n — 2 where the total of the depths 
(or total depth for short) is 2. It is well known that 
the entropy function H{k) is maximized by H{l/2) = 
log 2, so in the base case we do indeed have 2 < d{n) 
since d{n) > 2 log 2/ log 2 = 2. 

Proceeding inductively, the total depth for an A^-lcaf 
tree with L- and i?-lcaf subtrees is the total depth of 
L (at most d{L)), plus the total depth of R (at most 
d{R)), plus A^ (since each leaf is 1 deeper in the full 
tree). Since d{-) is a convex function, the worst case 
comes from the most unequal split, and applying the 
inductive hypothesis, the total depth for A^ is at most 

N+d{nN) + d{{l - k)N) 

kN \og{nN) (1 - k)N log((l - k)N) 



<N + 



H{k) 



H{k) 



N 



= N + ——{K\0gK + K.\0gN 

H{k) 



= N- 



+ (1 - k) log(l - k) + (1 - k) log A^) 
A 



{-H{k) + log N) 



Hin) 
= N log N/H{k) 
= d{N), 

completing the proof that d{N) is an upper bound. □ 
4.3 Experiments 

We conducted experiments on two datasets. The pur- 
pose of the first experiment is to show that the con- 
ditional probability tree (CPT) competes in predic- 
tion performance with existing exponentially slower 
approaches. To do this, we derive a label probability 
prediction problem from the publicly available Reuters 



RCVl dataset [S]- The second experiment is a full- 
scale test of the system where an exponentially slower 
approach is too intractable to seriously consider. We 
use a proprietary dataset that consists of webpages and 
associated advertisements, where the derived problem 
is to predict the probability that an ad would be dis- 
played on the webpage. 

Each dataset was split into a training and test set. 
Each training or test sample is of the form {x,y). The 
algorithms train on the training set and produce a 
probabilistic rule /(•,•) that maps pairs of the form 
{x,y) to numbers in the range [0,1], where we inter- 
pret f{x,y) as an approximation to P{y \ x). The 
algorithms are evaluated on the test set by computing 
the empirical squared loss, J2(x y)i^ ~ H^tV))^- The 
algorithms are allowed to continue learning as they are 
tested, however the predictions f{x,y) used above are 
computed before training on the sample {x,y). This 
type of evaluation is called "progressive validation" [T] 
and accurately measures the performance of an online 
algorithm. In particular, it is an unbiased estimate 
of the algorithm's performance under the assumption 
that the (a:, y) pairs are identically and independently 
distributed. In the motivating applications of our al- 
gorithm, we expect new labels to appear throughout 
the learning process, which requires learning to oc- 
cur continually in an online fashion. Thus, turning 
learning off and computing a "test loss" is less natu- 
ral. Nevertheless, for the Reuters dataset, we verified 
that the test loss and progressive validation are quite 
similar. For the web advertising dataset, the two mea- 
sures were drastically different (all methods performed 
much worse under test loss), due to the large number 
of labels that appear only in the test set. 

The CPT algorithm was executed with three tree- 
building construction methods: a random tree where 
uniform random left/right decisions were made until 
a leaf was encountered, a balanced tree according to 
algorithm [3] with a = I, and a general tree according 
to algorithm [3] with a < 1. For the binary regression 
problems (at the nodes), we used Vowpal Wabbit [5], 
which is a simple linear regressor trained by stochas- 
tic gradient descent. One essential enabling feature of 
VW is a hashing trick (described in [lTl[T2]) which al- 
lows us to represent 1.7 M linear regressors on a sparse 
feature space in a reasonable amount of RAM. 

4.3.1 Reuters RCVl 

The Reuters dataset consists of about 800K docu- 
ments, each assigned to one or more categories. A 
total of approximately 100 categories appear in the 
data. We split the data into a training set of 780K 
documents and a test set of 20K documents, opposite 
to its original use. For each document doc, we formed 



an example of the form {x,y), as follows. The vector x 
uses a "bag of words" representation of doc, weighted 
by the normalized TF-IDF scores, exactly as done in 
the paper [6|. The label y is one of the categories 
assigned to doc, chosen uniformly at random if more 
than one category was assigned to doc. 

We compared the CPT to the one-against-all algo- 
rithm, a standard approach for reducing multi-class 
regression to binary regression. The one-against-all 
approach regresses on the probability of each category 
c versus all other categories. Given a base training ex- 
ample (a;, y), the example used to train the regressor fc 
for category c is (x, I[y — c]), where /[•] is the indicator 
function. Predictions for a new test example {x, y) are 
done according to fy{x). The learning algorithm used 
for training the binary regressors in both approaches 
was incremental gradient descent with squared loss. 
For each algorithm, we ran several versions with dif- 
ferent learning rates, chosen from a coarse grid, and 
picked the setting that yielded the smallest training 
error. For the CPT algorithm, we performed a similar 
search over a. 

The one-against-all approach used one pass over 
the training data, while the CPT used two passes. 
Note that even with an additional pass, the CPT 
is much faster than one-against-all for training, 
due to the fact that CPT requires training only 
about log(number of categories) = log(103) regressors 
(nodes in the tree) per example, whereas one-against- 
all trains one regressor per category. On our machine, 
the CPT took 108 seconds to train, while one-against- 
all took 2300 seconds. We use Progressive Valida- 
tion [1] to compute an average squared loss over the 
test set with results appearing in the following table, 
where the confidence intervals are computed by Ho- 
effding's inequality [3 with 6 = 0.05. 



One-against-all 


0.55 ± .012 


CPT with a random tree 


0.56 ± .012 


CPT with a balanced tree 


0.56 ±.012 


CPT with an online tree {a = 0.6) 


0.56 ± .012 



The values are indeed mostly identical, but CPT 
achieved this performance with an order of magnitude 
less computation. 

Note that in this problem, there is not much advantage 
in using our algorithm over using a random tree. Since 
there aren't many labels and there are many examples, 
the structure of the tree is not very important. This 
is confirmed by running the algorithm with various 
different random trees and observing little variability 
in squared loss. 



4.3.2 Web Advertising 
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We used a proprietary dataset consisting of about 50M 
pairs of webpages and associated advertisments that 
were shown on the webpage. There are about 5.8M 
unique webpages and 860 A' unique ads in the dataset. 
The most frequent ad appeared in approximately 1.2% 
of the cases. The events were spht into a training set 
of size 40M, and a test set of size lOM in time or- 
der. Note that webpages and ads both appear multi- 
ple times in the training and test sets. For each event, 
where an event consisted of a single ad being shown on 
a single webpage, we create a sample (x, y), where x is 
a "bag of words" vector representation of the webpage, 
and y is a unique ID associated with the advertise- 
ment displayed. The learning problem is predictinge 
P{y I cc), or the probability that the logging policy 
displays advertisement y given webpage x. Since n is 
large, one-against-all would be extremely slow. The 
running time for our algorithm on this dataset was 
about 60 minutes. Multiplying by 860A;/ log2(860fc) 
suggests a running time for one-aginst-all of about 5 
years. 

Besides the three versions of CPT described above, 
we tested one other method we call the "table-based" 
method. In the table-based method, we simply predict 
P{y I x) by the empirical frequency with which ad y 
was displayed on webpage x in the training set. The 
progressive validation [IJ results of the four algorithms 
over the test set appear in the following table with 
confidence intervals again computed using Hoeffding's 
bound for S = 0.05. 



Method 


Squared Loss 


Equivalent 


Table 


0.812 ± .00055 


10.11 


Random tree 


0.7742 ± .00055 


8.32 


Balanced tree 


0.7725 ± .00055 


8.25 


Online tree {a — 0.9) 


0.7632 ± .00055 


7.91 


Best possible 


0.665 


5.42 



Here, the "Equivalent" column is the number of la- 
bels for which a uniform random process produces the 
same loss. The "Best possible" line is an unachievable 
bound on performance found by examining the empir- 
ical frequency of ad-webpage pairs in the test set. 

The magnitude of squared loss improvement is modest, 
but substantial enough to be useful. Since many of the 
webpages are seen many times, the conditional distri- 
bution over ads can be approximated well by empiri- 
cal frequencies. Thus, the table-based method forms 
a strong baseline. A small but significant fraction 
of the webpages were seen only a few times, and for 
these webpages, it was necesssary to generalize (pre- 
dict which ads would appear based on which ads ap- 
peared on pages similar to the current one) . On these 
examples, the tree performed substantially better. 
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